We focus on Web Data Commons - Training Dataset and Gold Standard for Large-Scale Prod- uct Matching dataset (WDC for short) prepared by the staff of the University of Mannheim (Primpeli et al., 2019).
Additionally, each offer is linked to a specific product (cluster id) and contains textual attributes such as title, description etc.
The training datasets are available in different sizes, varying from small to extra large. In every dataset, the ratio between positive and negative pairs is 1:3.
The proportions between the different collections within one category are as follows:
import pandas as pd
from eda_source import *
The data is provided as four different files - each per category: Computers, Cameras, Watches, and Shoes. This division is helpful as we plan to train separate models for each category, so no additional merging is required.
# Training data
df_computers = pd.read_json(path_or_buf='./data/computers_train_medium.json', lines=True)
Each observation is a pair of such offers and a label indicating whether these two offers are for the same product (a positive pair) or not (a negative pair). Even in the case of a negative pair, both offers belong to the same category (but different clusters/products).
print(f"In all categories there are the same features: \n {df_computers.columns.values}")
In all categories there are the same features: ['id_left' 'title_left' 'description_left' 'brand_left' 'price_left' 'specTableContent_left' 'keyValuePairs_left' 'category_left' 'cluster_id_left' 'identifiers_left' 'id_right' 'title_right' 'description_right' 'brand_right' 'price_right' 'specTableContent_right' 'keyValuePairs_right' 'category_right' 'cluster_id_right' 'identifiers_right' 'label' 'pair_id']
Feature meaning:
id: Unique integer identifier of an offer
cluster_id: The integer ID of the cluster (product) to which an offer belongs.
identifiers: A list of all identifier values that were assigned to an offer together with the schema.org terms that were used to annotate the values.
category: One of 25 product categories the product was assigned to, NaN if not part of the English subset.
title: The product title.
description: The product description.
brand: The product brand.
price: The product price.
specTableContent: The specification table content of the products website as one string.
keyValuePairs: The key-value pairs that were extracted from the specification tables using the method described above.
Note: the 'right' suffix represents a first offer in a pair, the 'left' suffix represents the other offer.
A positive pair: cluster_id_left == cluster_id_right
A negative pair: cluster_id_left != cluster_id_right
df_computers.head()
| id_left | title_left | description_left | brand_left | price_left | specTableContent_left | keyValuePairs_left | category_left | cluster_id_left | identifiers_left | ... | description_right | brand_right | price_right | specTableContent_right | keyValuePairs_right | category_right | cluster_id_right | identifiers_right | label | pair_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16876009 | 495906 b21 hp x5560 2 80ghz ml350 g6 , null ne... | description intel xeon x5560 ml350 g6 2 80ghz ... | hp enterprise | None | specifications category proliant processor sub... | {'category': 'proliant processor', 'sub catego... | Computers_and_Accessories | 1679624 | [{'/sku': '[495906b21]'}, {'/mpn': '[495906b21... | ... | description intel xeon x5560 ml350 g6 2 80ghz ... | hp enterprise | usd 213 85 | specifications category proliant processor sub... | {'category': 'proliant processor', 'sub catego... | Computers_and_Accessories | 1679624 | [{'/mpn': '[495906b21]'}] | 1 | 16876009#16248399 |
| 1 | 5490217 | hp intel xeon x5560 prijzen tweakers | None | None | None | categorie processors merk hp product hp intel ... | {'categorie': 'processors', 'merk': 'hp', 'pro... | Computers_and_Accessories | 1679624 | [{'/mpn': '[495906b21]'}, {'/gtin13': '[884420... | ... | description intel xeon x5560 ml350 g6 2 80ghz ... | hp enterprise | usd 213 85 | specifications category proliant processor sub... | {'category': 'proliant processor', 'sub catego... | Computers_and_Accessories | 1679624 | [{'/mpn': '[495906b21]'}] | 1 | 5490217#16248399 |
| 2 | 16543750 | asus motherboard lga2066 ddr4 m 2 u atx 2xgbe ... | None | None | None | None | None | Computers_and_Accessories | 109916 | [{'/sku': '[34181655]'}, {'/mpn': '[primex299d... | ... | None | None | None | categorie moederborden merk asus product asus ... | {'categorie': 'moederborden', 'merk': 'asus', ... | Computers_and_Accessories | 109916 | [{'/mpn': '[primex299deluxe, 90mb0ty0m0eay0]'... | 1 | 16543750#16721450 |
| 3 | 16721450 | asus prime x299 deluxe prijzen tweakers | None | None | None | categorie moederborden merk asus product asus ... | {'categorie': 'moederborden', 'merk': 'asus', ... | Computers_and_Accessories | 109916 | [{'/mpn': '[primex299deluxe, 90mb0ty0m0eay0]'... | ... | support for x series intel core processors sli... | asus | None | None | None | Computers_and_Accessories | 109916 | [{'/mpn': '[90mb0ty0m0eay0]'}, {'/gtin13': '[4... | 1 | 16721450#10358026 |
| 4 | 10358026 | asus prime x299 deluxe socket 2066 intel atx m... | support for x series intel core processors sli... | asus | None | None | None | Computers_and_Accessories | 109916 | [{'/mpn': '[90mb0ty0m0eay0]'}, {'/gtin13': '[4... | ... | atx quad channel ddr4 3 x pcie 3 0 x16 2 x m 2... | asus | None | None | None | Computers_and_Accessories | 109916 | [{'/productID': '[asux29del]'}, {'/mpn': '[pri... | 1 | 10358026#4588573 |
5 rows × 22 columns
n = df_computers.shape[0]
m = df_computers.shape[1]
print(f'Number of pairs: {n}, number of features: {m}')
Number of pairs: 8094, number of features: 22
print(f'Number of positive pairs: {get_positive_pairs_count(df_computers)} - {round(get_positive_pairs_count(df_computers) * 100 / n, 2)}% of all observations.')
print(f'Number of negative pairs: {get_negative_pairs_count(df_computers)} - {round(get_negative_pairs_count(df_computers) * 100 / n, 2)}% of all observations.')
Number of positive pairs: 6332 - 78.23% of all observations. Number of negative pairs: 1762 - 21.77% of all observations.
plot_positive_vs_negative(df_computers)
We plan to concatenate the features into one text for the input of models. Therefore, missing values are manageable.
missing_values = {}
col = 'title_left'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: title_left = 0, which is 0.0% of all pairs>
col = 'title_right'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: title_right = 0, which is 0.0% of all pairs>
col = 'description_left'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: description_left = 2442, which is 30.17% of all pairs>
col = 'description_right'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: description_right = 2461, which is 30.41% of all pairs>
col = 'brand_left'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: brand_left = 3948, which is 48.78% of all pairs>
col = 'brand_right'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: brand_right = 3996, which is 49.37% of all pairs>
col = 'price_left'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: price_left = 6520, which is 80.55% of all pairs>
col = 'price_right'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: price_right = 6653, which is 82.2% of all pairs>
col = 'specTableContent_left'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: specTableContent_left = 5303, which is 65.52% of all pairs>
col = 'specTableContent_right'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: specTableContent_right = 5256, which is 64.94% of all pairs>
plot_missing_values(missing_values, n)
Number of missing (for both right and left offer simultaneously) values per feature.
missing_values_simul = {}
col_prefix = 'title'
print_missing_values_simultaneously(df_computers, col_prefix, missing_values_simul)
Number of missing values in the column: (both title_right and title_left): = 0, which is 0.0% of all pairs>
col_prefix = 'description'
print_missing_values_simultaneously(df_computers, col_prefix, missing_values_simul)
Number of missing values in the column: (both description_right and description_left): = 954, which is 11.79% of all pairs>
col_prefix = 'brand'
print_missing_values_simultaneously(df_computers, col_prefix, missing_values_simul)
Number of missing values in the column: (both brand_right and brand_left): = 2575, which is 31.81% of all pairs>
col_prefix = 'price'
print_missing_values_simultaneously(df_computers, col_prefix, missing_values_simul)
Number of missing values in the column: (both price_right and price_left): = 5509, which is 68.06% of all pairs>
col_prefix = 'specTableContent'
print_missing_values_simultaneously(df_computers, col_prefix, missing_values_simul)
Number of missing values in the column: (both specTableContent_right and specTableContent_left): = 4074, which is 50.33% of all pairs>
plot_missing_values_simultaneously(missing_values_simul, n)
avg_lens = {}
avg_lens_pos = {}
avg_lens_neg = {}
col = 'title_right'
print_avg_lengths(df_computers, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: title_right for - all pairs = 69.34655300222387, pos. pairs = 69.33828174352496, neg. pairs = 69.37627695800226
col = 'title_left'
print_avg_lengths(df_computers, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: title_left for - all pairs = 68.4372374598468, pos. pairs = 68.50236891977258, neg. pairs = 68.20317820658343
col = 'description_left'
print_avg_lengths(df_computers, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: description_left for - all pairs = 241.86940943909067, pos. pairs = 242.86465571699304, neg. pairs = 238.2928490351873
col = 'description_right'
print_avg_lengths(df_computers, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: description_right for - all pairs = 215.88448233259206, pos. pairs = 216.29153506001262, neg. pairs = 214.4216799091941
plot_avg_lengths(avg_lens, avg_lens_pos, avg_lens_neg, n)
import plotly
## Preserve plotly functionality in a .html file
plotly.offline.init_notebook_mode()